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Abstract. The Support Vector Machine (SVM) method has been widely 
used in numerous classification tasks. The main idea of this algorithm is 
based on the principle of the margin maximization to find an hyperplane 
which separates the data into two different classes.In this paper, SVM 
is applied to phoneme recognition task. However, in many real-world 
problems, each phoneme in the data set for recognition problems may 
differ in the degree of significance due to noise, inaccuracies, or abnor¬ 
mal characteristics; All those problems can lead to the inaccuracies in the 
prediction phase. Unfortunately, the standard formulation of SVM does 
not take into account all those problems and, in particular, the variation 
in the speech input. 

This paper presents a new formulation of SVM (B-SVM) that attributes 
to each phoneme a confidence degree computed based on its geometric 
position in the space. Then, this degree is used in order to strengthen 
the class membership of the tested phoneme. Hence, we introduce a re¬ 
formulation of the standard SVM that incorporates the degree of belief. 
Experimental performance on TIMIT database shows the effectiveness 
of the proposed method B-SVM on a phoneme recognition problem. 
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1 Introduction 

Support Vector Machine (SVM) was, at first, introduced by Vladimir [2] for a 
binary classification tasks in order to construct, in the input space, the decision 
functions based on the theory of Structural Risk Minimization, ([3] and [3]). Af¬ 
terwards, SVM has been extended to support either the multi-class classification 
and regression tasks. SVM consists of constructing one or several hyperplanes 
in order to separate the data into the different classes. Nevertheless, an opti¬ 
mal hyperplane must be found in order to separate accurately the data into two 
classes. 

[3] defined the optimal hyperplane as the decision function with maximal margin. 
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Indeed, the margin can be defined as the shortest distance from the separating 
hyperplane and the closest vectors to the couple of classes. The application of 
SVM to the automatic speech recognition (ASR) problem has shown a com¬ 
petitive performance and accurate recognition rates. In the sound system of a 
language, a phoneme is considered as the smallest distinctive unit which is able 
to communicate a possible meaning. Thus, the success of the phoneme recog¬ 
nition task is important to the development of language systems. Nevertheless, 
during the signal acquisition process, the speech signal may be affected by the 
speaker characteristics such as his gender, accent, and style of speech. Also, there 
are other external factors which can admittedly have an impact on the speech 
recognition such as the noise coming from a microphone or the variation in the 
vocal tract shape. 

The standard formulation of SVM may not determine accurately the identity 
of the tested phoneme. Indeed, the speech signal is accompanied by all sorts of 
unpleasant variations during the acquisition. Those variations affect badly the 
recognition rates since the recognition mechanism may not be taken into ac¬ 
count those changes in the phoneme data. For example, in the real-application 
problems, the English pronunciation differences and the differences in accents 
may lead to increase significantly the error rate of any learning algorithm since 
all phoneme data are handled identically. Thus, the standard SVM may find an 
optimal hyperplane without considering the influences of the differences accom¬ 
panied by the speech signals. Thus, the identified optimal hyperplane can lead 
to loss of accuracies. 

In this paper, we propose a novel approach in order to incorporate a belief func¬ 
tion into the standard SVM algorithm which involves integrating confidence de¬ 
gree of each phoneme data. To fulfill this new formulation, we have, beforehand, 
compute the geometric distance between the centers of each possible class of the 
tested phoneme. Indeed, the benefit of hybrid approaches relies in their support 
to the decision-making and their ability to confirm the robustness of the recogni¬ 
tion system [12] , m- The experimental results with all phoneme datasets issued 
from the TIMIT database |S] show that the B-SVM outperforms the standard 
SVM and produces a better recognition rates. The rest of this paper is orga¬ 
nized as follows: Section [5] presents an overview of the method Support Vector 
Machines (SVM). Section [3] presents the steps of the phoneme processing and 
the problems which accompanying the speech processing. Section |4] presents the 
new formulation B-SVM algorithm; Section[5]describes the hierarchical phoneme 
recognition system; Section [5] presents the experimental results and a comparison 
between the standard SVM and B-SVM in a multi-class phoneme recognition 
problem. The final section is the conclusion. 

2 Support Vector Machines 

The Support Vector Machines (SVM) is a learning algorithm for pattern recog¬ 
nition and regression problems |5] whose approaches the classification problem 
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as an approximate implementation of the Structural Risk Minimization(SRM) 
induction principle [3]. 

SVM approximates the solution to the minimization problem of SRM through 
a Quadratic Programming optimization. It aims to maximize the margin which 
is the distance from a separating hyperplane to the closest positive or negative 
sample between classes. 

Hence the hyperplane that optimally separates the data is the one that min¬ 
imises: 

-j m 

+ ( 1 ) 

i=l 

Where C is a penalty to errors and ^ is a positive slack variable which measures 
the degree of misclassification. 
subject to the constraints: 

{w^^)(j){xt) + b’’^ > 1 — si y = i 
{w^^)(j){xt) + < 1 — jsi y = j 

> 0 ( 2 ) 

For the phoneme classification, the decision function of SVM is expressed as: 

m 

f{x) = sign(^aiyiK{xi,x)+ b) (3) 

i=l 

The above decision function gives a signed distance from a phoneme x to the 
hyperplane. 

However, when the data set is linearly non-separable, solving the parameters 
of this decision function becomes a quadratic programming problem. The solu¬ 
tion to this optimization problem can be cast to the Lagrange functional and the 
use of Lagrange multipliers ai, we obtain the Lagrangian of the dual objective 
function: 

m mm 

Ld = max ajyiyjK{xi,Xj). (4) 

i—1 i—1 j—l 

where K{xi,Xj) is the kernel of data Xi and Xj and the coefficients ai are the 
lagrange multipliers and are computed for each phoneme of the data set. They 
must be maximised with respect to ai > 0. It must be pointed out that the 
data with nonzero coefficients ai are called support vectors. They determine the 
decision boundary hyperplane of the classifier. 

Moreover, applying a kernel trick that maps an input vector into a higher dimen¬ 
sional feature sapce, allows to SVM to approximate a non-linear function [5] and 
[7]. In this paper, we use SVM with the radial basis function kernel (RBF).This 
kernel choice was made after doing a case study in order to find the suitable 
kernel with which SVM may achieve good generalization performance as well as 
the parameters to use m- Based on this principle, the SVM adopts a systematic 
approach to find a linear function that belongs to a set of functions with lowest 
VC dimension (the VapnikChervonenkis dimension measure the capacity of a 
statistical classification algorithm). 
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3 Phoneme processing 


Speech recognition is the process of converting an acoustic signal, captured by 
a microphone , to a set of words, syllables or phonemes. The speech recognition 
systems can be used for applications such as mobiles applications, commands, 
control, data entry, and document preparation. The steps of the speech process¬ 
ing are described in the figure f: 
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Fig. 1. Phoneme processing steps. 


The phoneme processing consists, first, on converting the speech captured 
by a microphone to a sequence of feature vectors. Then, a segmentation step is 
applied consisting on converting the continued speech signal to a set of units 
such as phonemes. Once the train and test data sets are prepared, a classifier is 
applied to classify the unknown phonemes. However, the phoneme recognition 
systems can be characterised by many parameters and problems which have the 
effect of making the task of recognition more difficult. Those factors can not 
be taking into account by the classifier since their accompanying the captured 
speech. 

In fact, the speech contains disfluencies, or periods of silence, and is much more 
difficult for the classifier to recognise than speech periods. In the other hand, 
the speaker is not able to say phrases in the same or similar manner each time. 
Thus, the phoneme recognition systems learn barely to recognize correctly the 
phoneme. The speaker’s voice quality, such as volume and pitch, and breath 
control should also be taken into account since they distorted the speech. Hence, 
the physiological elements must be taken into account in order to construct a 
robust phoneme recognition. 

Regrettably, the classifier is not able to take into account all those external 
factors which are inherent in the signal speech in the recognition process which 
may lead to a confusion inter-phonemes problem. In this paper, we propose to 
incorporate a confidence degree which will help the standard classifier SVM to 
find the optimal hyperplane and classify the phoneme into its class. 













Incorporating Belief Function in SVM for Phoneme Recognition 


5 


4 Belief SVM (B-SVM) 

The formulation of the proposed method B-SVM is described in three steps; the 
hrst step consists of computing the Euclidean distance d(V,V) between the 
center of the different classes and the phoneme to be classified Xi. The second 
step is to compute the confidence degree of the membership of the phoneme Xi 
into the class yi. Then, those confidence degrees are incorporated into SVM to 
help to find the optimal hyperplane. 


4.1 Geometric distance 

We propose to calculate the geometric distance between Xi and the center of the 
class CYi where i € (1,..., A:). We consider that there is a possibility to which 
the phoneme Xi belongs to one of the classes V. The geometric distance noted 
d{CYi,Xi) is calculated using euclidian distance. 

The higher value of d{CYi,Xi) is assigned to the most distant class V from 
the phoneme Xi and the lower value is associated with the closer class to the 
phoneme Xi. 


4.2 Confidence degree 

This step consists on calculating the confidence degree mi{X) of each phoneme 
Xi. It tells the possibility that Xi belongs to the class V. This proposed algo¬ 
rithm allows the generation of confidence degree for each phoneme: 

Calculate confidence degrees mi(Xi) 

begin 

Set of phoneme samples with lables {(Vi, Yi),..., (V„, Yfc)} ; 
Initialize confidence degree nrii of samples: 

1 if Xi in the ith class , 0 Otherwise; 

Ci:= Center of the ith class ; 
m,(V) := l/d{Ci,Xi) 
end . 


4.3 Formulation of belief SVM 

In a space where the data sets are not linearly separable and a multi-class clas¬ 
sification problem, SVM constructs k{k — l)/2 classifiers for the training data 
set. In order to convert the multi-class problem into multiple binary problems, 
the approach one-against-one is used. 

In the proposed B-SVM, we incorporate the confidence degree of each phoneme 
samples into the constraints since the identity is not affected by a scalar multi¬ 
plication. We normalized the hyperplane to satisfy: 
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4>{xt) + 6*-^ > 1 - Ct ,\i yt = i 
+ <1- yt = j 

> 0 (5) 

In fact, the incorporation of the confidence degree allows to to reduce the re¬ 
strictions when the phoneme have a high degree into the class. 

In the other hand, the dual representation of the standard SVM allows to 
maximise the of each phoneme. Thus, with high value of the confidence 
degree, the subject to > 0 can be easily satisfied which allows to consider this 
one as support vector which be helping to decide on the hyperplane. 

In the proposed B-SVM, we optimize this formulation to obtain a new dual 
representation; 


Ld = max E«-EE m{xi)m{xj)aiajyiyj<P{xi)(l>{xj). (6) 

2 = 1 2 = 1 j — 1 

In the standard SVM, the class V of a phoneme X is determined by the sign 
of the decision function. In the proposed B-SVM, the new decision function thus 
becomes: 

m 

'^va.{yi)aiyi‘^{xi) + h (7) 

i=l 

This new formulation will help for the decision making on the sign of phoneme 
in order to classify into its class. 

5 Hierarchical phoneme recognition system 

The architecture of our Hierarchical phoneme recognition systems is described 
in the figure 2: 

The recognition system proceeds as follows: (I) conversion from the speech 
waveform to a spectrogram (2) transforming the spectogram to a Mel-frequency 
cepstral coefficients (MFCC) spectrum using the Spectral analysis (3) segmen¬ 
tation of the phoneme data sets to sub-phoneme data sets (4) initiating the 
phoneme recognition at the first level of the system using B-SVM to recognize 
the class of the unknown phoneme (vowels or consonant) (5) and, finally, initi¬ 
ate the phoneme recognition at the second level of the system using B-SVM to 
recognize the identity of the unknown phoneme (i.e. aa, ae, ih , etc) [^. 

For the proposed recognition system, we have used the MEL frequency cep¬ 
stral coefficients (MFCC) feature extractor in order to convert the speech wave¬ 
form to a set of parametric representation. 

Davis and Mermelstein were the first who introduced the MFCC concept for 
automatic speech recognition [5]. The main idea of this algorithm consider that 
the MFCC are the cepstral coefficients calculated from the mel-frequency warped 
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Fig. 2. Hierarchical phoneme recognition system. 


Fourier transform representation of the log magnitude spectrum. The Delta and 
the Delta-Delta cepstral coefficients are an estimation of the time derivative of 
the MFCCs. Including the temporal cepstral derivative aim to improve the per¬ 
formance of speech recognition system. 

Those coefficients have shown a determinant capability to capture the transi¬ 
tional characteristics of the speech signal that can contribute to ameliorate the 
recognition task. The experiments using SVM are done using LibSVM toolbox 
m- The table 1 recapitulate our main choice of experiments conditions: 


Table 1. Experimental setup 


Method 

SVM 

7 

1/117 

Cost 

10 

Kernel trick 

RBF 

Windowing 

3-middle aligned Windows 

Corpus 

TIMIT 

Dialect 

New England 

Frame rate 

125/s 

Features technique 

MFCC 

Features number 

39 

Sampling frequency 

16ms 
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It should be noted that, for the nonlinear B-SVM method, we chose the 
RBF Kernel and the one-against-one strategy to carry out a multi-class SVM 
classification. Furthermore, the input speech signal is segmented into frames of 
16 ms with optional overlap of 1/3 ^ 1/2 of the frame size. If the sample rate 
is 16 kHz and the frame size is 256 sample points, then the frame duration is 
16ms. In addition, the frame rate is 125 frames per second. Each frame has to 
be multiplied with a Hamming window in order to keep the continuity of the 
first and the last points in the frame. 

6 Experimental results 

The table 2 shows prediction accuracies at both first and second levels of the 
hierarchical recognition system using seven different phoneme classes. 


Table 2. Accuracies of B-SVM and standard SVM 


Method 


B-SVM 


Standard SVM 


Acc. 

Precision 

Recall 

Acc. 

Precision 

Recall 


% 

% 

% 

% 

% 

% 

Level 1: 

95 

97 

92 

93 

89 

88 

Level 2: 

84 

83 

80 

78 

75 

73 

Vowels 

83 

86 

82 

76 

77 

71 

Occlusives 

85 

88 

82 

82 

86 

81 

Nasals 

80 

78 

69 

75 

63 

60 

Fricatives 

87 

76 

78 

83 

69 

70 

Semi-vowels 

87 

91 

91 

84 

91 

87 

Silences 

83 

71 

69 

75 

62 

68 

Affricates 

83 

88 

88 

71 

78 

77 


To investigate the accuracy of the proposed method B-SVM, we applied the 
standard SVM and B-SVM to Timit database. It must be pointed out that for the 
prediction, we used a test samples which were not included in the training stage. 
We compare the performance of both methods and we note that the performance 
of B-SVM is better than the standard SVM for all data sets used. 

Thus, the following results in the table 1 provides a summary through which 
we note that the proposed B-SVM shows a remarkable improvement over stan¬ 
dard SVM. 

7 Conclusion 

In our paper, we have proposed a new formulation of SVM using the confidence 
degree for each object. We have, also, built an hierarchical phoneme recognition 
system. 

The new method B-SVM seems to be more effective than the standard SVM 
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for all tested data sets. The new formulation of SVM succeeded in improving 
phoneme recognition since the allocation of belief weights for each phoneme have 
the ability for modeling the similarity between phonemes in order to reduce the 
confusions inter-phonemes. We compare the performance of both methods and 
we note that the performance of B-SVM is better than the standard SVM for 
all data sets used. 
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